The dataset I used is the red-wine quality data set from https://www.google.com/url?q=https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityReds.csv&sa=D&ust=1544986042420000. I ran the first section below, to figure out what variables I was dealing with, and then did some domain specific research regarding the variables included in the dataset and winemaking. Given that I don’t drink much wine, it was an uphill learning experience, but I can definitely see why having domain knowledge would not only make this better, but also probably go a lot faster. Much of what I learned about the various components of the dataset came from the following sources:
https://en.wikipedia.org/wiki/Acids_in_wine
https://www.decanter.com/learn/wine-terminology/sulfites-in-wine-friend-or-foe-295931/
http://srjcstaff.santarosa.edu/~jhenderson/Sulfur%20Dioxide.pdf
Up front, it’s worth noting a couple of things. First, acids in wine make a difference when it comes to quality. Therefore, I would expect acids to make a difference within the dataset. Similarily, I would expect wines with higher sulfates to have lower quality, on average, because of what I learned regarding the “dulling” of flavor by added sulfates. I would also expect a lower density from wines that have higher alcohol content, since alcohol is thinner than water. I would also expect a lower amount of residual sugars in higher alcohol wines, given how the alchol in all beverages is created.
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
Just some summary information on the variables. Some clear outliers in sulphates, chlorides, and the different acidities. And a huge one in sugar as well. I may need to correct for those, given how far out they are, though I’m also not sure if the values are reasonable. Probably better to keep the information and see what happens with transformations, or reframing the charts.
Quality is not registering as a factor, so I’m going to change that.
##
## FALSE
## 20787
Checking for missing data, of which there does not seem to be any.
Looking at a histogram of fixed acidity, there look to be a couple of cases out there on the end of the tail. I wonder if that’s a characteristic of a particular kind of wine, or if there’s something else going on.
There is another outlier way out there in the tail in this plot. I wonder if it’s related to the fixed acidity. Good to look to see if it’s a single case driving these values.
## 'data.frame': 1 obs. of 13 variables:
## $ X : int 1300
## $ fixed.acidity : num 7.6
## $ volatile.acidity : num 1.58
## $ citric.acid : num 0
## $ residual.sugar : num 2.1
## $ chlorides : num 0.137
## $ free.sulfur.dioxide : num 5
## $ total.sulfur.dioxide: num 9
## $ density : num 0.995
## $ pH : num 3.5
## $ sulphates : num 0.4
## $ alcohol : num 10.9
## $ quality : Ord.factor w/ 6 levels "3"<"4"<"5"<"6"<..: 1
It looks like it is just one case with those high values - 1300. Need to look to see if this will give it greater or less value later in the analysis. It also suggests that fixed and volatile acidity may be related.
Also worthwhile to change the bindwidth to see if I’m missing anything here.
With a smaller binwidth, it does look like, first, there might be some bimodal element to the distribution of volatile acidity. There also seems to be interesting peaks at certain values. Not sure why that would be, necessarily.
Looking at citric acid, it looks pretty skewed, with a lot of wines with almost no citric acid, and then one way out at 1. May be worth a log transform.
Changing the binwidth, we can see that the distribution looks a bit different, with a large number of cases at 0. We can also see spikes at about .25 and .5, again making me wonder if we’re seeing the effects of some other variable or soemthing intrinsic to specific kinds of red wine.
Also going to check the distribution after a log-transform.
Looks like there are two “humps” in the distribution. One at 0 and then one at around .1-.17. Not sure if there are multiple varieites of acidity in wine, but this suggests there may be.
I created a total acidity variable to capture the idea that acidity (as a whole) may affect wine quality - this is the distribution of that variable. Better than the individual acid levels as a whole.
This is extemly long tailed, so I’m going to try a log-transform to normalize the distribution and get a better look at it.
The distribution looks better, but the tail is still pretty long. I’m again interested to see what’s going on with those outliers.
## [1] 8
Thare are 8 wines with a log10 residual sugar of greater than 1.1. I’m going to look to see their characteristics
## X fixed.acidity volatile.acidity citric.acid
## Min. : 481 Min. : 5.600 Min. :0.2800 Min. :0.2500
## 1st Qu.:1243 1st Qu.: 5.975 1st Qu.:0.3050 1st Qu.:0.3575
## Median :1436 Median : 9.900 Median :0.4150 Median :0.3800
## Mean :1295 Mean : 8.537 Mean :0.4113 Mean :0.4350
## 3rd Qu.:1476 3rd Qu.:10.200 3rd Qu.:0.5100 3rd Qu.:0.5000
## Max. :1575 Max. :10.600 Max. :0.5400 Max. :0.7800
## residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :12.90 Min. :0.0540 Min. : 6.00 Min. : 23.00
## 1st Qu.:13.70 1st Qu.:0.0685 1st Qu.:18.75 1st Qu.: 82.00
## Median :13.85 Median :0.1395 Median :48.00 Median : 93.50
## Mean :14.26 Mean :0.1378 Mean :39.12 Mean : 92.75
## 3rd Qu.:15.40 3rd Qu.:0.2072 3rd Qu.:55.00 3rd Qu.: 99.50
## Max. :15.50 Max. :0.2140 Max. :72.00 Max. :160.00
## density pH sulphates alcohol quality
## Min. :0.9957 Min. :3.120 Min. :0.480 Min. : 8.800 3:0
## 1st Qu.:0.9971 1st Qu.:3.160 1st Qu.:0.555 1st Qu.: 8.950 4:1
## Median :1.0024 Median :3.180 Median :0.705 Median : 9.100 5:3
## Mean :1.0006 Mean :3.228 Mean :0.660 Mean : 9.637 6:4
## 3rd Qu.:1.0029 3rd Qu.:3.308 3rd Qu.:0.755 3rd Qu.:10.350 7:0
## Max. :1.0037 Max. :3.390 Max. :0.770 Max. :11.500 8:0
## total.acid
## Min. : 6.440
## 1st Qu.: 6.680
## Median :10.900
## Mean : 9.384
## 3rd Qu.:11.110
## Max. :11.270
Some interesting notes to look at here. Particularly, the fact that these wines’ densities are among the higher densities. Also, though they seem to group together in terms of the sugar, they have reasonable ranges on the other variables, notably the chlorides. Given the range on wine quality, this does not look to be determinative. Look for relationships here on the bi-variate analyses.
##
## 0.01 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13 0.14 0.15 0.16
## 2 1 23 65 174 306 498 215 136 54 44 7 4 6 6
## 0.17 0.18 0.19 0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.33 0.34 0.36 0.37
## 10 3 3 3 4 2 3 4 1 1 2 1 3 2 2
## 0.39 0.4 0.41 0.42 0.46 0.47 0.61
## 1 2 3 4 1 1 2
Looking at the distribution for chlorides rounded to two digits, it’s interesting to note that the top end of the chloride range has two wines in the higest amount. That amount is 23% higher than the next nearest level and much higher than the bulk of the wines. I’m wondering if that’s a good or a bad thing - maybe associated with very high quality? Very low quality? Other factors?
## X fixed.acidity volatile.acidity citric.acid residual.sugar
## 152 152 9.2 0.52 1.00 3.4
## 259 259 7.7 0.41 0.76 1.8
## chlorides free.sulfur.dioxide total.sulfur.dioxide density pH
## 152 0.610 32 69 0.9996 2.74
## 259 0.611 8 45 0.9968 3.06
## sulphates alcohol quality total.acid
## 152 2.00 9.4 4 10.72
## 259 1.26 9.4 5 8.87
There doesn’t seem to be a relationship here between extreme amounts of chlorides and wine quaity with both wines lying far into the tail scoring miding in terms of quality.
Worth noting that the log of cholorides is much more normal than the original variable. Not surprisingly, still slightly skewed with a long positive tail.
This also seems positively skewed, and has some odd characteristics, so I’ll try a log transform here as well to get a better sense of what’s going on in terms of the distribution.
This is pretty interesting, as it seems there is a bimodal distribution with one mode around 1.7 and another around 2.4 when looking at the log. Again, I wonder if this is indicative of particular types of red wine, or if it’s a function that determines quality.
There are some more crazy outliers on the end of that tail. Let’s check them out with a quick look.
## [1] 2
## X fixed.acidity volatile.acidity citric.acid
## Min. :1080 Min. :7.9 Min. :0.3 Min. :0.68
## 1st Qu.:1080 1st Qu.:7.9 1st Qu.:0.3 1st Qu.:0.68
## Median :1081 Median :7.9 Median :0.3 Median :0.68
## Mean :1081 Mean :7.9 Mean :0.3 Mean :0.68
## 3rd Qu.:1082 3rd Qu.:7.9 3rd Qu.:0.3 3rd Qu.:0.68
## Max. :1082 Max. :7.9 Max. :0.3 Max. :0.68
## residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :8.3 Min. :0.05 Min. :37.5 Min. :278.0
## 1st Qu.:8.3 1st Qu.:0.05 1st Qu.:37.5 1st Qu.:280.8
## Median :8.3 Median :0.05 Median :37.5 Median :283.5
## Mean :8.3 Mean :0.05 Mean :37.5 Mean :283.5
## 3rd Qu.:8.3 3rd Qu.:0.05 3rd Qu.:37.5 3rd Qu.:286.2
## Max. :8.3 Max. :0.05 Max. :37.5 Max. :289.0
## density pH sulphates alcohol quality
## Min. :0.9932 Min. :3.01 Min. :0.51 Min. :12.3 3:0
## 1st Qu.:0.9932 1st Qu.:3.01 1st Qu.:0.51 1st Qu.:12.3 4:0
## Median :0.9932 Median :3.01 Median :0.51 Median :12.3 5:0
## Mean :0.9932 Mean :3.01 Mean :0.51 Mean :12.3 6:0
## 3rd Qu.:0.9932 3rd Qu.:3.01 3rd Qu.:0.51 3rd Qu.:12.3 7:2
## Max. :0.9932 Max. :3.01 Max. :0.51 Max. :12.3 8:0
## total.acid
## Min. :8.88
## 1st Qu.:8.88
## Median :8.88
## Mean :8.88
## 3rd Qu.:8.88
## Max. :8.88
What’s interesting, and perhaps quite telling, is that the two wines identified look to be nearly identical except for their total sulpher dioxide levels. That means the only thing differentiating these wines is that level. I wonder if that will affect quality.
Given how skewed the distribution is, I’m going to log transform it to get a better handle on it.
That’s definitely much more normal, with a mean around 3.5 or so, but that small set of outliers is still there. Given that we’ve seen some other outliers in the other distributions, it makes me wonder if they may be related.
That’s a pretty nice looking distribution. I wonder what units density is in…
pH looks about as good as density, and given the shape of the distributions pH and density, it makes me wonder if there may be some relationship between the two.
Upon spome research (wikipedia.org), I found that pH tends to function as a measure of the strength of acidity, though there can be wines with low acidity and low pH, but it’s unusual. That’s definitely something to check out, though I would note that the histogram above dealing with both volitile and fixed acidity did not seem to be distributed similarly - though total acidity was better - so I’m curious how, and whether, those relationship show within this dataset.
This one is also positively skewed. I’ll try a log-transformation.
That distribution looks more normal, but it is still running with a bit of a tail and there is definitely an outlier near .07 or so. I’m going to check them.
## [1] 4
## [1] 87 92 93 152
There are 4 wines that have (log) sulphates greater than .6, and their ID’s are listed above.
## X fixed.acidity volatile.acidity citric.acid
## Min. : 87.00 Min. :8.60 Min. :0.4900 Min. :0.2800
## 1st Qu.: 90.75 1st Qu.:8.60 1st Qu.:0.4900 1st Qu.:0.2800
## Median : 92.50 Median :8.60 Median :0.4900 Median :0.2850
## Mean :106.00 Mean :8.75 Mean :0.4975 Mean :0.4625
## 3rd Qu.:107.75 3rd Qu.:8.75 3rd Qu.:0.4975 3rd Qu.:0.4675
## Max. :152.00 Max. :9.20 Max. :0.5200 Max. :1.0000
## residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :1.90 Min. :0.110 Min. :19.00 Min. : 69.0
## 1st Qu.:1.90 1st Qu.:0.110 1st Qu.:19.75 1st Qu.:117.0
## Median :1.95 Median :0.110 Median :20.00 Median :134.5
## Mean :2.30 Mean :0.235 Mean :22.75 Mean :118.5
## 3rd Qu.:2.35 3rd Qu.:0.235 3rd Qu.:23.00 3rd Qu.:136.0
## Max. :3.40 Max. :0.610 Max. :32.00 Max. :136.0
## density pH sulphates alcohol quality
## Min. :0.9972 Min. :2.740 Min. :1.950 Min. :9.40 3:0
## 1st Qu.:0.9972 1st Qu.:2.882 1st Qu.:1.950 1st Qu.:9.70 4:1
## Median :0.9972 Median :2.930 Median :1.965 Median :9.85 5:1
## Mean :0.9978 Mean :2.882 Mean :1.970 Mean :9.75 6:2
## 3rd Qu.:0.9978 3rd Qu.:2.930 3rd Qu.:1.985 3rd Qu.:9.90 7:0
## Max. :0.9996 Max. :2.930 Max. :2.000 Max. :9.90 8:0
## total.acid
## Min. : 9.370
## 1st Qu.: 9.370
## Median : 9.375
## Mean : 9.710
## 3rd Qu.: 9.715
## Max. :10.720
Again no immedieately obvious pattern here. The extreme values don’t seem to be having a large impact on any of the other variables - at least consistently.
Sort of a fatter distribution with a positive skew. And again, ther seem to be some outliers hanging out in the tail. I’m going to take a look to see if these are the same cases we saw before. It might indicate a relationship if so.
## [1] 1
There is 1 wine with greater than 14 for the amount of alcohol.
## , , volatile.acidity = 0.36, citric.acid = 0.65, residual.sugar = 7.5, chlorides = 0.096, free.sulfur.dioxide = 22, total.sulfur.dioxide = 71, density = 0.9976, pH = 2.98, sulphates = 0.84, alcohol = 14.9, quality = 3, total.acid = 16.91
##
## fixed.acidity
## X 15.9
## 653 0
##
## , , volatile.acidity = 0.36, citric.acid = 0.65, residual.sugar = 7.5, chlorides = 0.096, free.sulfur.dioxide = 22, total.sulfur.dioxide = 71, density = 0.9976, pH = 2.98, sulphates = 0.84, alcohol = 14.9, quality = 4, total.acid = 16.91
##
## fixed.acidity
## X 15.9
## 653 0
##
## , , volatile.acidity = 0.36, citric.acid = 0.65, residual.sugar = 7.5, chlorides = 0.096, free.sulfur.dioxide = 22, total.sulfur.dioxide = 71, density = 0.9976, pH = 2.98, sulphates = 0.84, alcohol = 14.9, quality = 5, total.acid = 16.91
##
## fixed.acidity
## X 15.9
## 653 1
##
## , , volatile.acidity = 0.36, citric.acid = 0.65, residual.sugar = 7.5, chlorides = 0.096, free.sulfur.dioxide = 22, total.sulfur.dioxide = 71, density = 0.9976, pH = 2.98, sulphates = 0.84, alcohol = 14.9, quality = 6, total.acid = 16.91
##
## fixed.acidity
## X 15.9
## 653 0
##
## , , volatile.acidity = 0.36, citric.acid = 0.65, residual.sugar = 7.5, chlorides = 0.096, free.sulfur.dioxide = 22, total.sulfur.dioxide = 71, density = 0.9976, pH = 2.98, sulphates = 0.84, alcohol = 14.9, quality = 7, total.acid = 16.91
##
## fixed.acidity
## X 15.9
## 653 0
##
## , , volatile.acidity = 0.36, citric.acid = 0.65, residual.sugar = 7.5, chlorides = 0.096, free.sulfur.dioxide = 22, total.sulfur.dioxide = 71, density = 0.9976, pH = 2.98, sulphates = 0.84, alcohol = 14.9, quality = 8, total.acid = 16.91
##
## fixed.acidity
## X 15.9
## 653 0
Looking at the ID number, this is not the same as the outliers identified above. Not sure what that means yet (and it may mean nothing). May become apparent in the bi-variate or multi-variate analyses. Also interesting to note that in terms of quality, it’s a mid-grade wine. Does this mean that alcohol doesn’t affect quality very much?
Nope, not really…Though it does seem like we are getting somewhat of a bi-modal thing going on - maybe even try with that bar around 2.4. Possible that certain kinds of red wine have different “regularized”" alcohol levels?
I guess I shouldn’t be too surprised that there are mostly mid-grade quality wines, though the top end of the distirbution is a bit interesting because there are more 8’s than 3’s. It would be interesting to see how price functioned with these different categories of quality.
The dataset contains 1599 observations on 13 variables, 12 of which have relevnace to the analysis (the last seems to be an ID for each wine). All of the variables are numeric, with the exception of quality, which was originally coded as an integer, but is actually an ordered factor with 0 - 10 as possible scores, though no wines score at either end of that spectrum.
The main feature of the dataset is quality. Alcohol seems to be important as well, and from both the above analysis, as well as my research, I suspect acidity and sulphates will be important.
sugar will play a role, and I know that it can also not live in an environment that is either too basic or acidic, so pH may play a role there as well.
I recoded quality into an ordered factor and created a “total acidity” variable by adding the other “acid” variables together (e.g. volatile, citric, and free).
Many of the distributions across features were quite skewed, or had outliers. I log-transformed several because of the long positive tail, which helped me get a better understanding of the total distributions. Specifically, I log-transformed sulphates, alchol, total sulfur dioxide, free sulfur dioxide, chlorides, residual sugars, and citric acid. I also changed the binwidth on several of the histograms because they looked to have specific values that grouped together (sometimes around an integer, often now). I speculate that this might be related to the different types of red wine within the dataset.
There are a number of interesting relationships in the plot above. I’m particularly interested in the relationship between density and pH, as well as between density and alcohol. To clarify some, I’m going to run a correlegram that highlights larger relationships.
Looks like there are several seemingly interesting relationships.
Fixed acidity & citric acid = .67
Fixed acidity & density = .67
Fixed acidity & PH = .68
Volatile acidity & citric acid = -.55
Citric acid & pH = -.54
Free sulfur dioxide & total sulfur dioxide = .67
Total acidity & pH = -.68
Total acidity & density = -.68
Acohol & density = -.50
Alcohol & quality = .48
The relationship here between citric acid and fixed acidity looks pretty positive, as expected from the correlogram.
The relationship here between fixed acidity and density seems moderate-strong.
It is a negative, relatively linear relationship. Number 653 is unique in that it is both very high alcohol and moderate-high density.
This is not surprising considering what I learned regarding the relationship between acidity and pH. In general, the lower the fixed acidity, the higher the pH. May be interesting to include in a model for quality.
There does seem to be some relationship here between quality and alcohol content. That said, there are fewer high-quality (8) and low quality (3) wines, so we may just be seeing an effect of decreased range at the ends of quality. Given that all three top categories of quality have 14 as their highest level of alcohol, that strengthens that position. Though the chart may also indicate that there is a rough division at 5 in terms of alcohol. Before that it predicts quality, after it does not.
If we recode quality into a dichotomous variable, let’s see what happens.
It looks like the positive relationship holds, though it is not as clear here.
Despite the lack of clear relationship, I am interested in the relationship between the sulphates and the quality of the wine, particularly because I did not transform it for the correlations.
The relationship between log-sulphates and quality looks to be non-linear, with midling-quality wines having the highest levels of sulphates. This relationship looks linear, though it does look a bit heteroskedastic.
Definitely can see a strong negative relationship between volatile acidity and citric acid. This is interesting because it’s the opposite of the relationship beteen fixed acidity and citric acid. I expect that (perhaps) citric acid is a fixed acid.
Looks like as citric acid increases, pH is lower, only it’s pretty moderate. It seems that though citric acid is present, it (and therefore fixed acidity) may not be what drives the “acidic” element of wine flavor (usually judged by pH).
Looks like a very slight negative relationship. I wonder if fixed acidity and volatile acidity are related.
Clearly this is a case where total sulfur dioxide includes free sulfur dioxide. Important to only include one if I build a model.
Total acid (which I calculated earlier) and pH have a negative relationship. Again, given what I know about the relationship between acidity and pH, this makes sense.
That looks like a pretty strong positive relationship between the total acidity and the density.
There were a lot of interesting relationships highlighted in the correlation matrix and the correlogram. Surprisingly, there are not a lot of strong, direct relationships between the principle feature of interest (quality) and other features, aside from alcohol content, which was positively related.
There were several interesting relationships between the other features. Notably, the fixed acidity seems to have a pretty strong relationship with both the density and citric acid levels, as well as (negative relationship with) pH. Given what I found out about how pH and acidity interact, it’s not surprising, though I didn’t expect it to correspond with the density of the wine. Perhaps high levels of acidity createa more dense wine.
I’ve highlighted other interesting relationships early in this section, but perhaps the most interesting is the alcohol and density relationship.
The strongest relationship, based on the Pearson’s correlation coefficient was the negative relationship between fixed acidity and pH. Which again, is not surprising. Similarly, the total acidity and pH had a similar Pearson’s coefficient (with both being -.68).
Given the relationships between alcohol and quality and alcohol and density, I wanted to see how quality stratifies across that relationship. It looks like higher quality wines have both a higher level of alcohol and a lower density, on average. Many of the lower-quality wines have the opposite.
Faceting by quality gives a slightly different picture. The relationship holds, but doesn’t seem to be as strong, particularly for the wines in the 3-4 category.
Looking at the relationshiup between total acidity and density, with the color by level of alcohol, we can see that lower alcohol wines tend to have higher densities across the level of acidity, though it’s strongest at mid-levels of density. I wonder how it is across quality types.
This makes it much more clear. The relationship between alcohol and density is not conditional on the quality level.
Quality goes down with the increase in volatile acidity, regardless of the amount of alcohol.
I also constructed the model, below.
m1 <- glm(quality_2 ~ alcohol + total.acid + density,
family=binomial(link='logit'),data=wine)
summary(m1)
pR2(m1)
vif(m1)
There was surprising consistency in the relationships relative to the feature of interest (quality). This suggests that there are pretty strong consistencies in what makes wine good or bad, even within quality “bins.”
Probably the relationship between acidity and density, when coupled with quality, was the most interesting relationship. That or perhaps the level of alcohol given the density, stratified across the wine quality bins.
I did create a model based on the recoding of the quality variable into a dichotomous variable. I used logistic regression to estimate the model, and though I found some significant variables, it was basically driven by alcohol and total acidity (perhaps not too surprising), but density was not significant (though it might be collinear with alcohol, given what I found in the bivariate) analyses.Additionally, the McFadden R-squared for the model was pretty poor, at .165. Basically, I’m bad at predicting wine quality.
I think this plot does a good job not only showing the relationship between alcohol and density, but also demonstrating that lower quality wines tend to have lower levels of alcohol, and therefore higher densities.
This actually just emphasizes the first plot, though I think it does a very good job in terms for demonstrating the impact of alcohol on wine quality. The mean level of wine as a horizontal line is also very helpful for that, I think.
I think this plot is useful in that it shows that the relationship between density and alcohol is not dependent upon the acidity of the wine while also demonstrating that lower quality wines are associated with less alcoholic, and therefore less dense, wines. ——
While I enjoyed this project, it does show me the value of domain knowledge and that the relationships in the data - even in the feature of interest - are not always very clear. This was difficult for me to work with, because I knew absolutely nothing about wine, or it’s elements in terms of the features here. More broadly speaking, I don’t think learning R has been too difficult through this process. While I sometimes struggle with understanding how to include different parts of a given chart (like a title, or something), I found the online resources available very helpful in terms of answering my own questions. Overall, I think the project helped me grow in my understanding of the structure of this type of programming, and certainly in EDA, though it was frustrating at times.
I made a couple of mistakes early on that I had to correct (notably, not creating the total acidity feature), and the decision to dichotomoze the quality variable may have been the wrong one, since it may havebeen worthwhile to create three categories.
While there are some very interesting things found in this analysis, I do think that another aspect that should studied is how an additional feature, the type of wine, plays into the idea of quality. Specifically, I’m interested in how alcohol content is related to quality given a specific type of red wine.